Stock Market Analysis and Prediction¶
The project aims to Analyze Stock Prices from Historical Data (Nairobi Securities Exchange from 2007 - 2012) to understand patterns, predict future movements, and make data-driven investment decisions. This project focuses on technical analysis, which involves analyzing historical market data like prices, volumes, and trends. Specifically, the goal is to predict the closing price of stocks based on key predictors.
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline
#html export
import plotly.io as pio
pio.renderers.default = 'notebook'
# import our data
df_2007 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2007.csv')
df_2008 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2008.csv')
df_2009 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2009.csv')
df_2010 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2010.csv')
df_2011 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2011.csv')
df_2012 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2012.csv')
Columns and Description¶
- DATE: The date of the stock information.
- CODE: The stock or index code (e.g., ^NASI for NSE All-Share Index).
- NAME: The name of the stock or index.
- 12m Low: The lowest price in the last 12 months.
- 12m High: The highest price in the last 12 months.
- Day Low: The lowest price for the day.
- Day High: The highest price for the day.
- Day Price: The closing price for the day.
- Previous: The previous closing price.
- Change: The difference between the day's closing price and the previous closing price.
- Change%: The percentage change from the previous closing price.
- Volume: The number of shares traded during the day.
- Adjust: Adjustment data for the stock (e.g., for splits or dividends).- **
df_2007
| DATE | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjust | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/2/2007 | EGAD | Eaagads | 22 | 57 | 52 | 52 | 52 | 52 | - | - | 300 | 26 |
| 1 | 1/2/2007 | KAPC | Kapchorua Tea | 111 | 148 | 100 | 100 | 100 | 98.5 | 1.5 | 1.50% | 7,800 | - |
| 2 | 1/2/2007 | KUKZ | Kakuzi | 67.5 | 89 | 42.5 | 45 | 43.5 | 42.25 | 1.25 | 2.87% | 2,700 | - |
| 3 | 1/2/2007 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 25 | 26.5 | 25.5 | 25.75 | -0.25 | 0.98% | 10,500 | - |
| 4 | 1/2/2007 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 140 | 145 | 140 | 141 | -1 | 0.71% | 70,900 | 23.33 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10603 | 12/24/2007 | EVRD | Eveready East Africa | 1.4 | 3 | 7 | 8.1 | 7.95 | 7.65 | 0.3 | 3.77% | 84,200 | - |
| 10604 | 12/24/2007 | MSC | Mumias Sugar | 4 | 7.35 | 13.7 | 15 | 14.8 | 14.3 | 0.5 | 3.38% | 2,400,000 | - |
| 10605 | 12/24/2007 | UNGA | Unga Group | 9.6 | 16.5 | 15 | 15.5 | 15.45 | 15.5 | -0.05 | 0.32% | 20,300 | 12.88 |
| 10606 | 12/24/2007 | ACCS | AccessKenya Group | 3.9 | 10.1 | 21.5 | 24.75 | 23.25 | 22.5 | 0.75 | 3.23% | 1,090,000 | - |
| 10607 | 12/24/2007 | ^N20I | NSE 20-Share Index | 3,396.83 | 5,030.91 | 5,444.83 | 5,444.83 | 5,444.83 | 5,339.75 | 105.08 | 1.93% | 12,530,000 | - |
10608 rows × 13 columns
df_2008
| Date | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/2/2008 | ^N20I | NSE 20-Share Index | 3,396.83 | 5,030.91 | 5,167.18 | 5,167.18 | 5,167.18 | 5,444.83 | -277.65 | 5.37% | 4,790,000 | - |
| 1 | 1/2/2008 | ACCS | AccessKenya Group | 3.9 | 10.1 | 21 | 25 | 21.25 | 23.25 | -2 | 9.41% | 538,100 | - |
| 2 | 1/2/2008 | UNGA | Unga Group | 9.6 | 16.5 | 15 | 15 | 15 | 15.45 | -0.45 | 3.00% | 25,000 | 12.5 |
| 3 | 1/2/2008 | MSC | Mumias Sugar | 4 | 7.35 | 13.35 | 14.2 | 13.6 | 14.8 | -1.2 | 8.82% | 617,200 | - |
| 4 | 1/2/2008 | EVRD | Eveready East Africa | 1.4 | 3 | 7.2 | 8 | 7.7 | 7.95 | -0.25 | 3.25% | 23,900 | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10578 | 12/31/2008 | C&G | Car and General (K) | 21 | 29 | 44 | 44 | 44 | 44 | - | - | 200 | 29.33 |
| 10579 | 12/31/2008 | WTK | Williamson Tea Kenya | 180 | 290 | 57.5 | 57.5 | 57.5 | 52.5 | 5 | 8.70% | 1,000 | - |
| 10580 | 12/31/2008 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 7 | 7.35 | 7 | 6.7 | 0.3 | 4.29% | 26,700 | - |
| 10581 | 12/31/2008 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 13.5 | 14 | 13.95 | 13.65 | 0.3 | 2.15% | 1,100 | - |
| 10582 | 12/31/2008 | KAPC | Kapchorua Tea | 111 | 148 | 68 | 68 | 68 | 70 | -2 | 2.94% | 500 | - |
10583 rows × 13 columns
df_2009
| DATE | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjust | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/2/2009 | ^N20I | NSE 20-Share Index | 3,396.83 | 5,030.91 | 3,589.16 | 3,589.16 | 3,589.16 | 3,521.18 | 67.98 | 1.89% | 3,750,000 | - |
| 1 | 1/2/2009 | ^NASI | NSE All-Share Index | 73.71 | 121.58 | 74.2 | 74.2 | 74.2 | 73.37 | 0.83 | 1.12% | 3,750,000 | - |
| 2 | 1/2/2009 | SCOM | Safaricom | 3.05 | 6.5 | 3.6 | 3.75 | 3.6 | 3.6 | - | - | 1,220,000 | - |
| 3 | 1/2/2009 | ACCS | AccessKenya Group | 3.9 | 10.1 | 21.25 | 22.75 | 22.25 | 20.75 | 1.5 | 6.74% | 200,700 | - |
| 4 | 1/2/2009 | UNGA | Unga Group | 9.6 | 16.5 | 12.85 | 13.6 | 12.9 | 13.6 | -0.7 | 5.43% | 4,500 | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14381 | 12/31/2009 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 11.5 | 11.5 | 11.5 | 11.25 | 0.25 | 2.17% | 200 | - |
| 14382 | 12/31/2009 | LIMT | Limuru Tea | 400 | 500 | 305 | 305 | 305 | 305 | - | - | 0 | - |
| 14383 | 12/31/2009 | KUKZ | Kakuzi | 67.5 | 89 | 31.75 | 31.75 | 31.75 | 31.75 | - | - | 0 | - |
| 14384 | 12/31/2009 | KAPC | Kapchorua Tea | 111 | 148 | 86 | 86 | 86 | 86 | - | - | 0 | - |
| 14385 | 12/31/2009 | EGAD | Eaagads | 22 | 57 | 20 | 20 | 20 | 20 | - | - | 0 | - |
14386 rows × 13 columns
df_2010
| DATE | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjust | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/4/2010 | ^NASI | NSE All-Share Index | 74.68 | 122.14 | 71.89 | 71.89 | 71.89 | 71.64 | 0.25 | 0.35% | 3,730,000 | - |
| 1 | 1/4/2010 | ^N20I | NSE 20-Share Index | 3,443.94 | 5,030.91 | 3,261.17 | 3,261.17 | 3,261.17 | 3,247.44 | 13.73 | 0.42% | 3,730,000 | - |
| 2 | 1/4/2010 | KPLC-P7 | Kenya Power and Lighting 7% | 5.5 | 5.5 | 5.5 | 5.5 | 5.5 | 5.5 | - | - | 0 | - |
| 3 | 1/4/2010 | KPLC-P4 | Kenya Power and Lighting 4% | 7.65 | 7.65 | 7.65 | 7.65 | 7.65 | 7.65 | - | - | 0 | - |
| 4 | 1/4/2010 | SCOM | Safaricom | 3.1 | 6.6 | 4.55 | 4.65 | 4.55 | 4.55 | - | - | 2,800,000 | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14947 | 12/31/2010 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 15.8 | 17.5 | 17.45 | 17 | 0.45 | 2.58% | 155,500 | - |
| 14948 | 12/31/2010 | LIMT | Limuru Tea | 400 | 500 | 300 | 300 | 300 | 300 | - | - | 0 | - |
| 14949 | 12/31/2010 | KUKZ | Kakuzi | 67.5 | 89 | 80 | 82 | 81.5 | 77 | 4.5 | 5.52% | 67,500 | - |
| 14950 | 12/31/2010 | KAPC | Kapchorua Tea | 111 | 148 | 100 | 100 | 100 | 100 | - | - | 0 | - |
| 14951 | 12/31/2010 | EGAD | Eaagads | 22 | 57 | 50 | 50 | 50 | 50 | - | - | 0 | - |
14952 rows × 13 columns
df_2011
| DATE | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjust | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/3/2011 | EGAD | Eaagads | 22 | 57 | 50 | 53.5 | 52.5 | 50 | 2.5 | 4.76% | 1,200 | - |
| 1 | 1/3/2011 | KAPC | Kapchorua Tea | 111 | 148 | 100 | 100 | 100 | 100 | - | - | 0 | - |
| 2 | 1/3/2011 | KUKZ | Kakuzi | 67.5 | 89 | 81.5 | 81.5 | 81.5 | 81.5 | - | - | 0 | - |
| 3 | 1/3/2011 | LIMT | Limuru Tea | 400 | 500 | 300 | 300 | 300 | 300 | - | - | 0 | - |
| 4 | 1/3/2011 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 17 | 18.55 | 18.45 | 17.45 | 1 | 5.42% | 25,000 | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14926 | 12/30/2011 | SCOM | Safaricom | 3.05 | 6.5 | 2.9 | 3.05 | 2.95 | 3 | -0.05 | 1.69% | 1,590,000 | - |
| 14927 | 12/30/2011 | KPLC-P4 | Kenya Power and Lighting 4% | 7.65 | 7.65 | 7.65 | 7.65 | 7.65 | 7.65 | - | - | 0 | - |
| 14928 | 12/30/2011 | KPLC-P7 | Kenya Power and Lighting 7% | 5.5 | 5.5 | 5.5 | 5.5 | 5.5 | 5.5 | - | - | 0 | - |
| 14929 | 12/30/2011 | ^N20I | NSE 20-Share Index | 3,396.83 | 5,030.91 | 3,205.02 | 3,205.02 | 3,205.02 | 3,160.03 | 44.99 | 1.40% | 4,710,000 | - |
| 14930 | 12/30/2011 | ^NASI | NSE All-Share Index | 73.71 | 121.58 | 68.03 | 68.03 | 68.03 | 67.85 | 0.18 | 0.26% | 4,710,000 | - |
14931 rows × 13 columns
df_2012
| DATE | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjust | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/3/2012 | ^NASI | NSE All-Share Index | 73.23 | 119.37 | 68.38 | 68.38 | 68.38 | 68.03 | 0.35 | 0.51% | 10,990,000 | - |
| 1 | 1/3/2012 | ^N20I | NSE 20-Share Index | 3,312.56 | 4,985.91 | 3,212.86 | 3,212.86 | 3,212.86 | 3,205.02 | 7.84 | 0.24% | 10,990,000 | - |
| 2 | 1/3/2012 | KPLC-P7 | Kenya Power and Lighting 7% | 5.5 | 5.5 | 5.5 | 5.5 | 5.5 | 5.5 | - | - | 0 | - |
| 3 | 1/3/2012 | KPLC-P4 | Kenya Power and Lighting 4% | 7.65 | 7.65 | 7.65 | 7.65 | 7.65 | 7.65 | - | - | 0 | - |
| 4 | 1/3/2012 | SCOM | Safaricom | 3.05 | 6.5 | 2.9 | 3 | 2.95 | 2.95 | - | - | 7,410,000 | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 16039 | 12/31/2012 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 19 | 19 | 19 | 19 | - | - | 1,000 | - |
| 16040 | 12/31/2012 | LIMT | Limuru Tea | 400 | 500 | 430 | 430 | 430 | 430 | - | - | 0 | - |
| 16041 | 12/31/2012 | KUKZ | Kakuzi | 67.5 | 89 | 72 | 72 | 72 | 69 | 3 | 4.17% | 1,000 | - |
| 16042 | 12/31/2012 | KAPC | Kapchorua Tea | 111 | 148 | 118 | 118 | 118 | 118 | - | - | 100 | - |
| 16043 | 12/31/2012 | EGAD | Eaagads | 22 | 57 | 25 | 25 | 25 | 25 | - | - | 0 | - |
16044 rows × 13 columns
Preparing the dataset¶
# List of all dataframes
dfs = [df_2007, df_2008, df_2009, df_2010, df_2011, df_2012]
# Loop through each dataframe
for df in dfs:
# Check if the column 'Adjust' exists
if 'Adjust' in df.columns:
# Rename 'Adjust' to 'Adjusted' for consistency
df.rename(columns={'Adjust': 'Adjusted'}, inplace=True)
# Check if the column 'Date' exists
if 'Date' in df.columns:
# Rename 'Date' to 'DATE' for consistency
df.rename(columns={'Date': 'DATE'}, inplace=True)
# Combine all the dataframes from 2007 to 2012
combined_df = pd.concat([df_2007, df_2008, df_2009, df_2010, df_2011, df_2012])
# Reset the index to keep it clean
combined_df.reset_index(drop=True, inplace=True)
df = combined_df.copy()
df
| DATE | CODE | NAME | 12m Low | 12m High | Day Low | Day High | Day Price | Previous | Change | Change% | Volume | Adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/2/2007 | EGAD | Eaagads | 22 | 57 | 52 | 52 | 52 | 52 | - | - | 300 | 26 |
| 1 | 1/2/2007 | KAPC | Kapchorua Tea | 111 | 148 | 100 | 100 | 100 | 98.5 | 1.5 | 1.50% | 7,800 | - |
| 2 | 1/2/2007 | KUKZ | Kakuzi | 67.5 | 89 | 42.5 | 45 | 43.5 | 42.25 | 1.25 | 2.87% | 2,700 | - |
| 3 | 1/2/2007 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 25 | 26.5 | 25.5 | 25.75 | -0.25 | 0.98% | 10,500 | - |
| 4 | 1/2/2007 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 140 | 145 | 140 | 141 | -1 | 0.71% | 70,900 | 23.33 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 81499 | 12/31/2012 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 19 | 19 | 19 | 19 | - | - | 1,000 | - |
| 81500 | 12/31/2012 | LIMT | Limuru Tea | 400 | 500 | 430 | 430 | 430 | 430 | - | - | 0 | - |
| 81501 | 12/31/2012 | KUKZ | Kakuzi | 67.5 | 89 | 72 | 72 | 72 | 69 | 3 | 4.17% | 1,000 | - |
| 81502 | 12/31/2012 | KAPC | Kapchorua Tea | 111 | 148 | 118 | 118 | 118 | 118 | - | - | 100 | - |
| 81503 | 12/31/2012 | EGAD | Eaagads | 22 | 57 | 25 | 25 | 25 | 25 | - | - | 0 | - |
81504 rows × 13 columns
# make column names uniform
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.columns
Index(['date', 'code', 'name', '12m_low', '12m_high', 'day_low', 'day_high',
'day_price', 'previous', 'change', 'change%', 'volume', 'adjusted'],
dtype='object')
numerical_columns = ['12m_low', '12m_high', 'day_low', 'day_high', 'day_price', 'previous', 'change', 'volume', 'adjusted']
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81504 entries, 0 to 81503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 81504 non-null object 1 code 81504 non-null object 2 name 81456 non-null object 3 12m_low 81504 non-null object 4 12m_high 81504 non-null object 5 day_low 81504 non-null object 6 day_high 81504 non-null object 7 day_price 81504 non-null object 8 previous 81504 non-null object 9 change 81504 non-null object 10 change% 81504 non-null object 11 volume 81504 non-null object 12 adjusted 81504 non-null object dtypes: object(13) memory usage: 8.1+ MB
# Converting specified columns
# Replace commas with empty strings in the selected columns
df[numerical_columns] = df[numerical_columns].replace({',': ''}, regex=True)
# Convert to numeric
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')
# Remove '%' symbol, handle invalid values, and convert to decimal
df['change%'] = pd.to_numeric(df['change%'].str.replace('%', '', regex=True), errors='coerce') / 100
numerical_columns.append('change%')
# Convert the 'DATE' column to datetime
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y', errors='coerce')
df
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007-01-02 | EGAD | Eaagads | 22.0 | 57.0 | 52.0 | 52.0 | 52.0 | 52.00 | NaN | NaN | 300 | 26.00 |
| 1 | 2007-01-02 | KAPC | Kapchorua Tea | 111.0 | 148.0 | 100.0 | 100.0 | 100.0 | 98.50 | 1.50 | 0.0150 | 7800 | NaN |
| 2 | 2007-01-02 | KUKZ | Kakuzi | 67.5 | 89.0 | 42.5 | 45.0 | 43.5 | 42.25 | 1.25 | 0.0287 | 2700 | NaN |
| 3 | 2007-01-02 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 25.0 | 26.5 | 25.5 | 25.75 | -0.25 | 0.0098 | 10500 | NaN |
| 4 | 2007-01-02 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 140.0 | 145.0 | 140.0 | 141.00 | -1.00 | 0.0071 | 70900 | 23.33 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 81499 | 2012-12-31 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 19.0 | 19.0 | 19.0 | 19.00 | NaN | NaN | 1000 | NaN |
| 81500 | 2012-12-31 | LIMT | Limuru Tea | 400.0 | 500.0 | 430.0 | 430.0 | 430.0 | 430.00 | NaN | NaN | 0 | NaN |
| 81501 | 2012-12-31 | KUKZ | Kakuzi | 67.5 | 89.0 | 72.0 | 72.0 | 72.0 | 69.00 | 3.00 | 0.0417 | 1000 | NaN |
| 81502 | 2012-12-31 | KAPC | Kapchorua Tea | 111.0 | 148.0 | 118.0 | 118.0 | 118.0 | 118.00 | NaN | NaN | 100 | NaN |
| 81503 | 2012-12-31 | EGAD | Eaagads | 22.0 | 57.0 | 25.0 | 25.0 | 25.0 | 25.00 | NaN | NaN | 0 | NaN |
81504 rows × 13 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81504 entries, 0 to 81503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 81504 non-null datetime64[ns] 1 code 81504 non-null object 2 name 81456 non-null object 3 12m_low 81504 non-null float64 4 12m_high 81504 non-null float64 5 day_low 81504 non-null float64 6 day_high 81504 non-null float64 7 day_price 81504 non-null float64 8 previous 81492 non-null float64 9 change 51141 non-null float64 10 change% 51131 non-null float64 11 volume 81504 non-null int64 12 adjusted 20509 non-null float64 dtypes: datetime64[ns](1), float64(9), int64(1), object(2) memory usage: 8.1+ MB
Data Cleaning¶
# duplicates
df.duplicated().sum()
19
#dropping duplicates
df = df.drop_duplicates()
df.duplicated().sum()
0
df.isna().sum()
date 0 code 0 name 48 12m_low 0 12m_high 0 day_low 0 day_high 0 day_price 0 previous 12 change 30344 change% 30354 volume 0 adjusted 60979 dtype: int64
# dataframe where 'previous' column has nulls
df[df['previous'].isna()]
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15053 | 2008-05-30 | HFCK-R | NaN | 11.05 | 11.05 | 5.00 | 20.00 | 6.55 | NaN | 6.55 | 1.0 | 17200 | NaN |
| 22070 | 2009-02-02 | KPLC-P7 | Kenya Power and Lighting 7% | 5.50 | 5.50 | 5.00 | 5.00 | 5.00 | NaN | 5.00 | 1.0 | 0 | NaN |
| 22079 | 2009-02-02 | CARB | Carbacid Investments | 88.50 | 140.00 | 137.00 | 137.00 | 137.00 | NaN | 137.00 | 1.0 | 0 | 45.67 |
| 22080 | 2009-02-02 | BOC | B.O.C Kenya | 90.00 | 130.00 | 160.00 | 160.00 | 160.00 | NaN | 160.00 | 1.0 | 0 | NaN |
| 22099 | 2009-02-02 | UCHM | Uchumi Supermarket | 12.70 | 24.00 | 14.50 | 14.50 | 14.50 | NaN | 14.50 | 1.0 | 0 | NaN |
| 44813 | 2010-08-16 | TPSE-R | TPS Eastern Africa (Serena) (Rights) | 5.10 | 5.10 | 5.00 | 8.00 | 5.40 | NaN | 5.40 | 1.0 | 700 | NaN |
| 45711 | 2010-09-07 | SCBK-R | Standard Chartered Bank (Rights) | 51.00 | 89.00 | 70.00 | 70.00 | 70.00 | NaN | 70.00 | 1.0 | 500 | NaN |
| 49242 | 2010-12-01 | KPLC-R | Kenya Power and Lighting (Rights) | 0.55 | 0.55 | 4.00 | 4.50 | 4.00 | NaN | 4.00 | 1.0 | 11100 | NaN |
| 58420 | 2011-07-14 | TCL | Trans-Century | 20.00 | 37.75 | 50.00 | 60.00 | 57.00 | NaN | 57.00 | 1.0 | 1140000 | NaN |
| 69653 | 2012-04-03 | KQ-R | Kenya Airways (Rights) | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | NaN | 0.05 | 1.0 | 228200 | NaN |
| 72010 | 2012-05-30 | LKL | Longhorn Kenya | 8.50 | 21.00 | 15.00 | 20.00 | 16.15 | NaN | 16.15 | 1.0 | 4200 | NaN |
| 74136 | 2012-07-18 | CIC | CIC Insurance Group | 3.05 | 7.00 | 4.00 | 7.00 | 5.25 | NaN | 5.25 | 1.0 | 1620000 | NaN |
# Drop rows where 'Previous' column is null
df = df[df['previous'].notna()]
df.isna().sum()
date 0 code 0 name 47 12m_low 0 12m_high 0 day_low 0 day_high 0 day_price 0 previous 0 change 30344 change% 30354 volume 0 adjusted 60968 dtype: int64
# dataframe where 'name' column has nulls
null_name = df[df['name'].isna()]
null_name
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7359 | 2007-09-05 | OCH-R | NaN | 0.60 | 0.60 | 0.50 | 1.00 | 0.95 | 19.20 | -18.25 | 19.2105 | 253000 | NaN |
| 7403 | 2007-09-06 | OCH-R | NaN | 0.60 | 0.60 | 0.45 | 1.00 | 0.45 | 0.95 | -0.50 | 1.1111 | 511800 | NaN |
| 7448 | 2007-09-07 | OCH-R | NaN | 0.60 | 0.60 | 0.05 | 0.50 | 0.20 | 0.45 | -0.25 | 1.2500 | 404800 | NaN |
| 7492 | 2007-09-10 | OCH-R | NaN | 0.60 | 0.60 | 0.15 | 0.70 | 0.40 | 0.20 | 0.20 | 0.5000 | 272200 | NaN |
| 7537 | 2007-09-11 | OCH-R | NaN | 0.60 | 0.60 | 0.20 | 1.00 | 0.20 | 0.40 | -0.20 | 1.0000 | 5170000 | NaN |
| 7583 | 2007-09-12 | OCH-R | NaN | 0.60 | 0.60 | 0.25 | 1.00 | 0.55 | 0.20 | 0.35 | 0.6364 | 329000 | NaN |
| 7627 | 2007-09-13 | OCH-R | NaN | 0.60 | 0.60 | 0.10 | 1.00 | 0.40 | 0.55 | -0.15 | 0.3750 | 3040000 | NaN |
| 7671 | 2007-09-14 | OCH-R | NaN | 0.60 | 0.60 | 0.40 | 2.00 | 0.60 | 0.40 | 0.20 | 0.3333 | 3650000 | NaN |
| 8723 | 2007-10-22 | NIC-R | NaN | 5.05 | 98.00 | 54.50 | 55.00 | 54.50 | 191.00 | -136.50 | 2.5046 | 1000 | NaN |
| 8764 | 2007-10-23 | NIC-R | NaN | 5.05 | 98.00 | 90.00 | 100.00 | 95.00 | 54.50 | 40.50 | 0.4263 | 5100 | NaN |
| 8807 | 2007-10-24 | NIC-R | NaN | 5.05 | 98.00 | 95.00 | 100.00 | 99.00 | 95.00 | 4.00 | 0.0404 | 11200 | NaN |
| 8850 | 2007-10-25 | NIC-R | NaN | 5.05 | 98.00 | 95.00 | 100.00 | 96.00 | 99.00 | -3.00 | 0.0313 | 8500 | NaN |
| 8894 | 2007-10-26 | NIC-R | NaN | 5.05 | 98.00 | 87.00 | 95.00 | 89.00 | 96.00 | -7.00 | 0.0787 | 4600 | NaN |
| 8940 | 2007-10-29 | NIC-R | NaN | 5.05 | 98.00 | 84.00 | 90.00 | 85.00 | 89.00 | -4.00 | 0.0471 | 10700 | NaN |
| 8985 | 2007-10-30 | NIC-R | NaN | 5.05 | 98.00 | 85.50 | 95.00 | 93.50 | 85.00 | 8.50 | 0.0909 | 5000 | NaN |
| 9029 | 2007-10-31 | NIC-R | NaN | 5.05 | 98.00 | 93.00 | 100.00 | 95.00 | 93.50 | 1.50 | 0.0158 | 142100 | NaN |
| 9071 | 2007-11-01 | NIC-R | NaN | 5.05 | 98.00 | 90.00 | 100.00 | 92.00 | 95.00 | -3.00 | 0.0326 | 35500 | NaN |
| 9118 | 2007-11-02 | NIC-R | NaN | 5.05 | 98.00 | 90.00 | 100.00 | 98.00 | 92.00 | 6.00 | 0.0612 | 13000 | NaN |
| 9195 | 2007-11-06 | DTK-R | NaN | 2.30 | 50.00 | 9.00 | 9.00 | 9.00 | 15.70 | -6.70 | 0.7444 | 4100 | NaN |
| 9239 | 2007-11-07 | DTK-R | NaN | 2.30 | 50.00 | 26.50 | 26.50 | 26.50 | 9.00 | 17.50 | 0.6604 | 16000 | NaN |
| 9283 | 2007-11-08 | DTK-R | NaN | 2.30 | 50.00 | 12.00 | 26.00 | 23.50 | 26.50 | -3.00 | 0.1277 | 8800 | NaN |
| 9325 | 2007-11-09 | DTK-R | NaN | 2.30 | 50.00 | 12.00 | 15.00 | 12.50 | 23.50 | -11.00 | 0.8800 | 8600 | NaN |
| 9368 | 2007-11-12 | DTK-R | NaN | 2.30 | 50.00 | 12.00 | 14.00 | 12.05 | 12.50 | -0.45 | 0.0373 | 64100 | NaN |
| 15097 | 2008-06-03 | HFCK-R | NaN | 11.05 | 11.05 | 8.00 | 11.00 | 10.05 | 6.55 | 3.50 | 0.3483 | 24200 | NaN |
| 15141 | 2008-06-04 | HFCK-R | NaN | 11.05 | 11.05 | 8.00 | 15.00 | 11.85 | 10.05 | 1.80 | 0.1519 | 131900 | NaN |
| 15186 | 2008-06-05 | HFCK-R | NaN | 11.05 | 11.05 | 11.00 | 15.00 | 13.05 | 11.85 | 1.20 | 0.0920 | 5820000 | NaN |
| 15230 | 2008-06-06 | HFCK-R | NaN | 11.05 | 11.05 | 8.00 | 14.00 | 9.70 | 13.05 | -3.35 | 0.3454 | 17540000 | NaN |
| 15272 | 2008-06-09 | HFCK-R | NaN | 11.05 | 11.05 | 9.00 | 13.50 | 9.25 | 9.70 | -0.45 | 0.0486 | 162500 | NaN |
| 15316 | 2008-06-10 | HFCK-R | NaN | 11.05 | 11.05 | 9.55 | 11.00 | 10.45 | 9.25 | 1.20 | 0.1148 | 118200 | NaN |
| 15359 | 2008-06-11 | HFCK-R | NaN | 11.05 | 11.05 | 10.00 | 12.00 | 10.40 | 10.45 | -0.05 | 0.0048 | 137100 | NaN |
| 15404 | 2008-06-12 | HFCK-R | NaN | 11.05 | 11.05 | 10.05 | 13.25 | 11.65 | 10.40 | 1.25 | 0.1073 | 254700 | NaN |
| 15448 | 2008-06-13 | HFCK-R | NaN | 11.05 | 11.05 | 10.00 | 12.50 | 11.05 | 11.65 | -0.60 | 0.0543 | 342600 | NaN |
| 15708 | 2008-06-23 | KCB-R | NaN | 0.25 | 0.25 | 5.60 | 6.80 | 6.60 | 217.00 | -210.40 | 31.8788 | 86700 | NaN |
| 15753 | 2008-06-24 | KCB-R | NaN | 0.25 | 0.25 | 5.50 | 6.80 | 5.75 | 6.60 | -0.85 | 0.1478 | 147700 | NaN |
| 15797 | 2008-06-25 | KCB-R | NaN | 0.25 | 0.25 | 5.50 | 31.25 | 9.85 | 5.75 | 4.10 | 0.4162 | 76200 | NaN |
| 15843 | 2008-06-26 | KCB-R | NaN | 0.25 | 0.25 | 4.35 | 30.00 | 6.35 | 9.85 | -3.50 | 0.5512 | 93800 | NaN |
| 15888 | 2008-06-27 | KCB-R | NaN | 0.25 | 0.25 | 1.00 | 34.75 | 5.25 | 6.35 | -1.10 | 0.2095 | 1010000 | NaN |
| 15933 | 2008-06-30 | KCB-R | NaN | 0.25 | 0.25 | 5.00 | 6.00 | 5.35 | 5.25 | 0.10 | 0.0187 | 109100 | NaN |
| 15979 | 2008-07-01 | KCB-R | NaN | 0.25 | 0.25 | 5.05 | 6.50 | 5.35 | 5.35 | NaN | NaN | 133700 | NaN |
| 16026 | 2008-07-02 | KCB-R | NaN | 0.25 | 0.25 | 5.05 | 6.00 | 5.45 | 5.35 | 0.10 | 0.0183 | 239200 | NaN |
| 16074 | 2008-07-03 | KCB-R | NaN | 0.25 | 0.25 | 4.35 | 31.25 | 5.05 | 5.45 | -0.40 | 0.0792 | 3790000 | NaN |
| 16122 | 2008-07-04 | KCB-R | NaN | 0.25 | 0.25 | 5.00 | 5.50 | 5.25 | 5.05 | 0.20 | 0.0381 | 165300 | NaN |
| 16170 | 2008-07-07 | KCB-R | NaN | 0.25 | 0.25 | 3.00 | 5.35 | 5.15 | 5.25 | -0.10 | 0.0194 | 246800 | NaN |
| 16218 | 2008-07-08 | KCB-R | NaN | 0.25 | 0.25 | 4.80 | 5.30 | 5.00 | 5.15 | -0.15 | 0.0300 | 45720000 | NaN |
| 16265 | 2008-07-09 | KCB-R | NaN | 0.25 | 0.25 | 5.05 | 6.00 | 5.10 | 5.00 | 0.10 | 0.0196 | 13580000 | NaN |
| 16311 | 2008-07-10 | KCB-R | NaN | 0.25 | 0.25 | 5.20 | 9.60 | 5.45 | 5.10 | 0.35 | 0.0642 | 448300 | NaN |
| 16357 | 2008-07-11 | KCB-R | NaN | 0.25 | 0.25 | 5.50 | 7.00 | 5.95 | 5.45 | 0.50 | 0.0840 | 4720000 | NaN |
# getting associated codes for null names
null_name['code'].unique()
array(['OCH-R', 'NIC-R', 'DTK-R', 'HFCK-R', 'KCB-R'], dtype=object)
# dataframe with associated codes for null names
null_name_codes = df[df['code'].isin(['OCH-R', 'NIC-R', 'DTK-R', 'HFCK-R', 'KCB-R'])]
null_name_codes
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7359 | 2007-09-05 | OCH-R | NaN | 0.60 | 0.6 | 0.50 | 1.0 | 0.95 | 19.20 | -18.25 | 19.2105 | 253000 | NaN |
| 7403 | 2007-09-06 | OCH-R | NaN | 0.60 | 0.6 | 0.45 | 1.0 | 0.45 | 0.95 | -0.50 | 1.1111 | 511800 | NaN |
| 7448 | 2007-09-07 | OCH-R | NaN | 0.60 | 0.6 | 0.05 | 0.5 | 0.20 | 0.45 | -0.25 | 1.2500 | 404800 | NaN |
| 7492 | 2007-09-10 | OCH-R | NaN | 0.60 | 0.6 | 0.15 | 0.7 | 0.40 | 0.20 | 0.20 | 0.5000 | 272200 | NaN |
| 7537 | 2007-09-11 | OCH-R | NaN | 0.60 | 0.6 | 0.20 | 1.0 | 0.20 | 0.40 | -0.20 | 1.0000 | 5170000 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 76209 | 2012-09-03 | NIC-R | NIC Bank (Rights) | 5.05 | 98.0 | 9.05 | 13.0 | 10.40 | 10.00 | 0.40 | 0.0385 | 149900 | NaN |
| 76274 | 2012-09-04 | NIC-R | NIC Bank (Rights) | 5.05 | 98.0 | 9.40 | 13.0 | 11.15 | 10.40 | 0.75 | 0.0673 | 121700 | NaN |
| 76339 | 2012-09-05 | NIC-R | NIC Bank (Rights) | 5.05 | 98.0 | 12.50 | 15.0 | 14.55 | 11.15 | 3.40 | 0.2337 | 248100 | NaN |
| 76404 | 2012-09-06 | NIC-R | NIC Bank (Rights) | 5.05 | 98.0 | 12.00 | 14.5 | 13.05 | 14.55 | -1.50 | 0.1149 | 486500 | NaN |
| 76469 | 2012-09-07 | NIC-R | NIC Bank (Rights) | 5.05 | 98.0 | 10.05 | 13.8 | 12.85 | 13.05 | -0.20 | 0.0156 | 1290000 | NaN |
74 rows × 13 columns
# unique names for the associated codes
null_name_codes['name'].unique()
array([nan, 'Kenya Commercial Bank (Rights)',
'Diamond Trust Bank Kenya (Rights)', 'NIC Bank (Rights)'],
dtype=object)
# Create the dictionary for replacements
replacement_dict = {
'NIC-R': 'NIC Bank (Rights)',
'DTK-R': 'Diamond Trust Bank Kenya (Rights)',
'KCB-R': 'Kenya Commercial Bank (Rights)',
'OCH-R': 'Olympia Capital Holdings Ltd (Rights)',
'HFCK-R': 'HF Group Ltd (Rights)'
}
# Iterate through the replacement dictionary
for code, replacement_name in replacement_dict.items():
# Update the 'NAME' column where the 'CODE' matches and 'NAME' is null
df.loc[(df['code'] == code) & (df['name'].isna()), 'name'] = replacement_name
df.isna().sum()
date 0 code 0 name 0 12m_low 0 12m_high 0 day_low 0 day_high 0 day_price 0 previous 0 change 30344 change% 30354 volume 0 adjusted 60968 dtype: int64
# dataframe where 'change' column has nulls
df[df['change'].isna()]
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | adjusted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007-01-02 | EGAD | Eaagads | 22.00 | 57.00 | 52.0 | 52.0 | 52.0 | 52.0 | NaN | NaN | 300 | 26.00 |
| 7 | 2007-01-02 | C&G | Car and General (K) | 21.00 | 29.00 | 50.0 | 50.0 | 50.0 | 50.0 | NaN | NaN | 66700 | 33.33 |
| 16 | 2007-01-02 | NBK | National Bank of Kenya | 16.25 | 24.25 | 57.0 | 60.0 | 58.5 | 58.5 | NaN | NaN | 84900 | 41.79 |
| 17 | 2007-01-02 | NIC | NIC Bank | 22.95 | 58.00 | 101.0 | 105.0 | 102.0 | 102.0 | NaN | NaN | 9800 | 23.22 |
| 18 | 2007-01-02 | SCBK | Standard Chartered Bank | 160.00 | 320.00 | 202.0 | 210.0 | 205.0 | 205.0 | NaN | NaN | 54400 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 81498 | 2012-12-31 | SASN | Sasini Tea and Coffee | 10.50 | 13.60 | 11.7 | 12.0 | 11.7 | 11.7 | NaN | NaN | 32200 | NaN |
| 81499 | 2012-12-31 | REA | Rea Vipingo Plantations | 14.50 | 23.50 | 19.0 | 19.0 | 19.0 | 19.0 | NaN | NaN | 1000 | NaN |
| 81500 | 2012-12-31 | LIMT | Limuru Tea | 400.00 | 500.00 | 430.0 | 430.0 | 430.0 | 430.0 | NaN | NaN | 0 | NaN |
| 81502 | 2012-12-31 | KAPC | Kapchorua Tea | 111.00 | 148.00 | 118.0 | 118.0 | 118.0 | 118.0 | NaN | NaN | 100 | NaN |
| 81503 | 2012-12-31 | EGAD | Eaagads | 22.00 | 57.00 | 25.0 | 25.0 | 25.0 | 25.0 | NaN | NaN | 0 | NaN |
30344 rows × 13 columns
# replacing nulls with zero to indicate no change
df['change'] = df['change'].fillna(0)
df['change%'] = df['change%'].fillna(0)
df.isna().sum()
date 0 code 0 name 0 12m_low 0 12m_high 0 day_low 0 day_high 0 day_price 0 previous 0 change 0 change% 0 volume 0 adjusted 60968 dtype: int64
df['adjusted'].isna().sum()/len(df) * 100
74.83215298319689
Approximately 75% of the adjusted column is null hence safe to drop this column
# Drop adjusted column
del df['adjusted']
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 81473 entries, 0 to 81503 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 81473 non-null datetime64[ns] 1 code 81473 non-null object 2 name 81473 non-null object 3 12m_low 81473 non-null float64 4 12m_high 81473 non-null float64 5 day_low 81473 non-null float64 6 day_high 81473 non-null float64 7 day_price 81473 non-null float64 8 previous 81473 non-null float64 9 change 81473 non-null float64 10 change% 81473 non-null float64 11 volume 81473 non-null int64 dtypes: datetime64[ns](1), float64(8), int64(1), object(2) memory usage: 8.1+ MB
df
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007-01-02 | EGAD | Eaagads | 22.0 | 57.0 | 52.0 | 52.0 | 52.0 | 52.00 | 0.00 | 0.0000 | 300 |
| 1 | 2007-01-02 | KAPC | Kapchorua Tea | 111.0 | 148.0 | 100.0 | 100.0 | 100.0 | 98.50 | 1.50 | 0.0150 | 7800 |
| 2 | 2007-01-02 | KUKZ | Kakuzi | 67.5 | 89.0 | 42.5 | 45.0 | 43.5 | 42.25 | 1.25 | 0.0287 | 2700 |
| 3 | 2007-01-02 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 25.0 | 26.5 | 25.5 | 25.75 | -0.25 | 0.0098 | 10500 |
| 4 | 2007-01-02 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 140.0 | 145.0 | 140.0 | 141.00 | -1.00 | 0.0071 | 70900 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 81499 | 2012-12-31 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 19.0 | 19.0 | 19.0 | 19.00 | 0.00 | 0.0000 | 1000 |
| 81500 | 2012-12-31 | LIMT | Limuru Tea | 400.0 | 500.0 | 430.0 | 430.0 | 430.0 | 430.00 | 0.00 | 0.0000 | 0 |
| 81501 | 2012-12-31 | KUKZ | Kakuzi | 67.5 | 89.0 | 72.0 | 72.0 | 72.0 | 69.00 | 3.00 | 0.0417 | 1000 |
| 81502 | 2012-12-31 | KAPC | Kapchorua Tea | 111.0 | 148.0 | 118.0 | 118.0 | 118.0 | 118.00 | 0.00 | 0.0000 | 100 |
| 81503 | 2012-12-31 | EGAD | Eaagads | 22.0 | 57.0 | 25.0 | 25.0 | 25.0 | 25.00 | 0.00 | 0.0000 | 0 |
81473 rows × 12 columns
EDA (Exploratory Data Analysis)¶
Taking 'day_price' as our target variable
numericals = ['12m_low', '12m_high', 'day_low', 'day_high',
'day_price', 'previous', 'change', 'change%', 'volume']
categorical = ['code']
# trying to see the trend of 'day_price' across the years
fig = px.line(df, x=df['date'], y=df['day_price'])
fig.update_layout(title='Daily Day Price', title_x=0.5, title_y=0.95 ) # Center the title (0.5 means centered)
df2 = df.copy()
# Set the 'date' column as the index
df2.set_index('date', inplace=True)
# Now resample 'day_price' by month and compute the mean
monthly_avg = df2['day_price'].resample('M').mean()
# Create a line plot with monthly averages
fig = px.line(monthly_avg, x=monthly_avg.index, y=monthly_avg, labels={'x': 'Date', 'y': 'Average Day Price'})
fig.update_layout(title='Monthly Average Day Price', title_x=0.5, title_y=0.95 ) # Center the title (0.5 means centered)
# Show the plot
fig.show()
C:\Users\Deninjo\AppData\Local\Temp\ipykernel_13476\694275928.py:7: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
px.histogram(df, x=df['day_price'])
∴ The prices have a long tail distribution as most prices are cheap and few expensive ones
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Create subplots for each numerical column
fig = make_subplots(rows=3, cols=3, subplot_titles=[f'Boxplot for {col}' for col in numericals])
# Loop through each numerical column and create a boxplot
for i, column in enumerate(numericals, 1):
row = (i - 1) // 3 + 1 # Determine the row index (1 to 3)
col = (i - 1) % 3 + 1 # Determine the column index (1 to 3)
fig.add_trace(
go.Box(
y=df[column], # Data for the boxplot
name=column # Name of the boxplot (will appear in the title)
),
row=row, col=col # Specify the location of the subplot
)
# Update the layout to add a title and improve spacing
fig.update_layout(
title_text="Boxplots for Numerical Columns", # Title for the entire plot
title_x=0.5, # Center the title
height=900, # Adjust the height of the plot
showlegend=False # Hide the legend (not needed for boxplots)
)
# Show the plot
fig.show()
Handling Outliers¶
Winsorization reduces the effect of outliers by capping the extreme values in each numerical column between the 20th and 80th percentiles. This makes the data less sensitive to extreme outliers while retaining the overall distribution.
from scipy.stats import mstats
df3 = df.copy()
# Winsorize the data for each numerical column
for column in numericals:
df3[column] = mstats.winsorize(df3[column], limits=[0.2, 0.2]) # Replace outliers with 20th and 80th percentiles
# Create subplots for each numerical column
fig = make_subplots(rows=3, cols=3, subplot_titles=[f'Boxplot for {col}' for col in numericals])
# Loop through each numerical column and create a boxplot
for i, column in enumerate(numericals, 1):
row = (i - 1) // 3 + 1 # Determine the row index (1 to 3)
col = (i - 1) % 3 + 1 # Determine the column index (1 to 3)
fig.add_trace(
go.Box(
y=df3[column], # Data for the boxplot
name=column # Name of the boxplot (will appear in the title)
),
row=row, col=col # Specify the location of the subplot
)
# Update the layout to add a title and improve spacing
fig.update_layout(
title_text="Boxplots for Winsorized Numerical Columns", # Title for the entire plot
title_x=0.5, # Center the title
height=900, # Adjust the height of the plot
showlegend=False # Hide the legend (not needed for boxplots)
)
# Show the plot
fig.show()
Model Training¶
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
# Import dataset
df3
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007-01-02 | EGAD | Eaagads | 22.0 | 57.0 | 52.0 | 52.0 | 52.0 | 52.00 | 0.00 | 0.0000 | 300 |
| 1 | 2007-01-02 | KAPC | Kapchorua Tea | 88.5 | 148.0 | 100.0 | 100.0 | 100.0 | 98.50 | 0.25 | 0.0150 | 7800 |
| 2 | 2007-01-02 | KUKZ | Kakuzi | 67.5 | 89.0 | 42.5 | 45.0 | 43.5 | 42.25 | 0.25 | 0.0230 | 2700 |
| 3 | 2007-01-02 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 25.0 | 26.5 | 25.5 | 25.75 | -0.25 | 0.0098 | 10500 |
| 4 | 2007-01-02 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 125.0 | 129.0 | 127.0 | 127.00 | -0.25 | 0.0071 | 70900 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 81499 | 2012-12-31 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 19.0 | 19.0 | 19.0 | 19.00 | 0.00 | 0.0000 | 1000 |
| 81500 | 2012-12-31 | LIMT | Limuru Tea | 88.5 | 148.0 | 125.0 | 129.0 | 127.0 | 127.00 | 0.00 | 0.0000 | 200 |
| 81501 | 2012-12-31 | KUKZ | Kakuzi | 67.5 | 89.0 | 72.0 | 72.0 | 72.0 | 69.00 | 0.25 | 0.0230 | 1000 |
| 81502 | 2012-12-31 | KAPC | Kapchorua Tea | 88.5 | 148.0 | 118.0 | 118.0 | 118.0 | 118.00 | 0.00 | 0.0000 | 200 |
| 81503 | 2012-12-31 | EGAD | Eaagads | 22.0 | 57.0 | 25.0 | 25.0 | 25.0 | 25.00 | 0.00 | 0.0000 | 200 |
81473 rows × 12 columns
# Define X and Y
numericals = ['12m_low', '12m_high', 'day_low', 'day_high','previous','volume']
x = df3[numericals].values
y = df3['day_price'].values
# Split the dataset into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
Training and evaluating different models
# Scale the features to transforms the data into a distribution with:
# A mean of 0
# A standard deviation of 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize models
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'KNN Regression': KNeighborsRegressor(n_neighbors=5),
'Decision Tree Regression': DecisionTreeRegressor(),
}
# Evaluate models
results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
results.append([name, mae, mse, rmse, r2])
# Display results
results_df = pd.DataFrame(results, columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2'])
results_df.set_index('Model', inplace=True)
results_df
| MAE | MSE | RMSE | R2 | |
|---|---|---|---|---|
| Model | ||||
| Linear Regression | 0.175972 | 0.367902 | 0.606549 | 0.999823 |
| Ridge Regression | 0.176169 | 0.369004 | 0.607457 | 0.999822 |
| KNN Regression | 0.242779 | 0.673791 | 0.820848 | 0.999676 |
| Decision Tree Regression | 0.174512 | 0.436381 | 0.660591 | 0.999790 |
# Plotly Bar graph
fig = px.bar(
results_df,
x=results_df.index,
y=['MAE', 'MSE', 'RMSE', 'R2'],
barmode='group',
title="Model Evaluation Metrics",
labels={"value": "Score", "Model": "Model"},
height=500
)
# Show the plot
fig.show()
# Find the best model based on the desired metric (R²)
best_model_r2 = results_df['R2'].idxmax() # Model with highest R²
best_model_mse = results_df['MSE'].idxmin() # Model with lowest MSE
best_model_mae = results_df['MAE'].idxmin() # Model with lowest MAE
best_model_rmse = results_df['RMSE'].idxmin() # Model with lowest RMSE
# Display the best models based on each metric
print(f"Best Model based on R²: {best_model_r2}")
print(f"Best Model based on MSE: {best_model_mse}")
print(f"Best Model based on MAE: {best_model_mae}")
print(f"Best Model based on RMSE: {best_model_rmse}")
Best Model based on R²: Linear Regression Best Model based on MSE: Linear Regression Best Model based on MAE: Decision Tree Regression Best Model based on RMSE: Linear Regression
∴ Linear Regression is the better performing model
# Initialize models
lr_model = LinearRegression()
dt_model = DecisionTreeRegressor()
# Fit models to the training data
lr_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)
y_pred_dt = dt_model.predict(X_test)
Error Plots
# Create a figure and axis
plt.figure(figsize=(14, 6))
# Plot predicted vs actual values for Linear Regression
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_lr, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--') # Line of perfect prediction
plt.title('Linear Regression: Predicted vs Actual')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
# Plot predicted vs actual values for Decision Tree Regressor
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_dt, color='green', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--') # Line of perfect prediction
plt.title('Decision Tree Regressor: Predicted vs Actual')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
# Display the plots
plt.tight_layout()
plt.show()
Our Models are fairly good as most points lie along the line where predicted = actual (y = x).
Residual plots
import matplotlib.pyplot as plt
# Calculate residuals for Linear Regression
residuals_lr = y_test - y_pred_lr
# Calculate residuals for Decision Tree Regressor
residuals_dt = y_test - y_pred_dt
# Create residual plot for Linear Regression
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.scatter(y_pred_lr, residuals_lr, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot for Linear Regression')
plt.xlabel('Predicted Values (Linear Regression)')
plt.ylabel('Residuals')
# Create residual plot for Decision Tree Regressor
plt.subplot(1, 2, 2)
plt.scatter(y_pred_dt, residuals_dt, color='green', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot for Decision Tree Regressor')
plt.xlabel('Predicted Values (Decision Tree Regressor)')
plt.ylabel('Residuals')
# Show the plots
plt.tight_layout()
plt.show()
Our Models are fairly good as the residuals are scattered randomly close to the zero line
Using the Model¶
df3
| date | code | name | 12m_low | 12m_high | day_low | day_high | day_price | previous | change | change% | volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2007-01-02 | EGAD | Eaagads | 22.0 | 57.0 | 52.0 | 52.0 | 52.0 | 52.00 | 0.00 | 0.0000 | 300 |
| 1 | 2007-01-02 | KAPC | Kapchorua Tea | 88.5 | 148.0 | 100.0 | 100.0 | 100.0 | 98.50 | 0.25 | 0.0150 | 7800 |
| 2 | 2007-01-02 | KUKZ | Kakuzi | 67.5 | 89.0 | 42.5 | 45.0 | 43.5 | 42.25 | 0.25 | 0.0230 | 2700 |
| 3 | 2007-01-02 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 25.0 | 26.5 | 25.5 | 25.75 | -0.25 | 0.0098 | 10500 |
| 4 | 2007-01-02 | SASN | Sasini Tea and Coffee | 10.5 | 13.6 | 125.0 | 129.0 | 127.0 | 127.00 | -0.25 | 0.0071 | 70900 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 81499 | 2012-12-31 | REA | Rea Vipingo Plantations | 14.5 | 23.5 | 19.0 | 19.0 | 19.0 | 19.00 | 0.00 | 0.0000 | 1000 |
| 81500 | 2012-12-31 | LIMT | Limuru Tea | 88.5 | 148.0 | 125.0 | 129.0 | 127.0 | 127.00 | 0.00 | 0.0000 | 200 |
| 81501 | 2012-12-31 | KUKZ | Kakuzi | 67.5 | 89.0 | 72.0 | 72.0 | 72.0 | 69.00 | 0.25 | 0.0230 | 1000 |
| 81502 | 2012-12-31 | KAPC | Kapchorua Tea | 88.5 | 148.0 | 118.0 | 118.0 | 118.0 | 118.00 | 0.00 | 0.0000 | 200 |
| 81503 | 2012-12-31 | EGAD | Eaagads | 22.0 | 57.0 | 25.0 | 25.0 | 25.0 | 25.00 | 0.00 | 0.0000 | 200 |
81473 rows × 12 columns
# Select the features from the i'th row
features = ['12m_low', '12m_high', 'day_low', 'day_high', 'previous', 'volume']
X_single_row = df3.loc[4, features].values.reshape(1, -1) # Reshape to 2D array
# Make a prediction using the trained model
y_pred_single = lr_model.predict(X_single_row)
# Get the actual price from the df3['day_price'] column for the same row
y_actual_single = df3.loc[4, 'day_price']
# Output the prediction and actual value
print(f"Predicted value : {y_pred_single[0]}")
print(f"Actual value : {y_actual_single}")
Predicted value : 126.91836872539568 Actual value : 127.0